I came up with a new, better alternative to BNF and EBNF for a syntax specification syntax, and then I found out something similar already exists, called PEG. I also came up with a novel algorithm for parsing (which I haven’t implemented, which is why I’m creating this writeup instead) and then discovered that something similar already exists, known as an LR(0) parser, such as in this paper: https://www.cs.clemson.edu/course/cpsc827/material/LRk/LR0.pdf. In this writeup, I’ll explain my own special grammar, which I’ll call EPEG (extended parsing expression grammar), and my own special parser, which I call GraphParser.
The idea behind EPEG, as with PEG, is to regex-like grouping, repetition, and other such features in a parser grammar. EPEG has everything PEG has and a few more things:
- In addition to
?,+, and*, there are??,+?, and*?, which capture the least amount possible as they do in regex. - There are
{x}and{x,y}as in regex to repeat a clause x times or between x and y times. - You can specify a rule name within another rule by using
:=, as in(BLAH* phew:=(FOO+ BAR))*?. This would simply give the expression that matchesFOO+ BARthe labelphewin the parsed tree if it follows an expression matchingBLAH*. - You can specify a starting rule that will encompass the entire parsed text, for example by specifying
start foo. By default the rule that encompasses the entire text is the first rule specified in the grammar. - The specification should probably include backreferences to previously captured groups like regex has, using
\1,\2, etc. like in regex. The problem with that, though, is that we’d have to have both capturing grouping and non-capturing grouping like in regex, but the normal, non-capturing grouping used in PEG uses(), while()in regex is capturing and(?:)is non-capturing. I wouldn’t want to use(?:)for non-capturing and()for capturing because non-capturing would by far be the more popular usage and()is simpler to type and to read. And I wouldn’t want to reverse()‘s and(?:)‘s roles because that would be confusing. So I’d have to figure something out. - I thought of allowing specification of regex tokens done in the same way as specification of EPEG tokens, and even allowing regex inside EPEG rules, but I decided against it, as it would mean my EPEG parser would have to parse on the character-by-character level, which would introduce some difficulties in both the implementation and the specification syntax, and would also be less efficient, at least if the parsing is down in a high-level language like I’d planned. Though if I weren’t lazy, I could just create a unique token for each regex specification and include it in the lexer definition…
- There should be a way to specify “semantic actions” in both the parser and the lexer, but, IMO, the actual language used for the functions defined thereby should depend on the parser’s/lexer’s implementation and therefore should not be part of EPEG’s specification. But the syntax for associating the functions with the rules should itself be part of the grammar. Maybe the syntax could be that the code goes inside of
{}right after the rule definition like in BISON.
My algorithm for the parser is…
- Convert the grammar to a graph.
For example, if the EPEG code is
Then the graph will be a node beginning a rule calledfoo = 'x' bar 'y' bar = 'a'*foo. which points to a node containingx, which points to a node beginning a rule calledbar, which points to a node containinga, which points to itself and also to a node containingy, which is understood by some denotation to be outside of thebarrule and to be another part of thefoorule. - Start following symbols in the graph one at a time, while reading the input tokens in the text to be parsed, and comparing them to the successive symbols.
- Wherever the graph splits, such as with the presence of an
|(‘or’), follow both lineages simultaneously. For example, in the graph described above, if you were parsingxaay, the symbol in the graph that matches the firstawould point to itself, as well as toy, so the nextawould be compared to both theanode and theynode. (This example shows the clever way in which*is to be implemented.) - Poll the next symbol for each of the paths currently being followed, until one of the symbols matches the next token. The order in which the current paths are polled and compared to the next token depends on which rule has precedence in the grammar specification.
- Wherever one of the next symbols polled from the various paths doesn’t match the next token, drop that path from the list of paths currently being compared.
- If the next symbol in a path starts a rule and it matches the next token, then increment the stack pointer.
- Of course, if the next symbol is another rule specification, then poll the first symbol specified in that rule, which would be the node that node points to. Or, if that node points to yet another rule, then poll the first node pointed to in that rule, etc.
- Of course, if the rule has a
|, then split the paths and poll all the next symbols, which are the first symbols on either side of the `| - If the next symbol in a path ends a rule and it matches the next token, then decrement the stack pointer, and add the name of the rule and the sequence composing it to the current stack level. The final output will be the sequence at stack[0].
- Each token encountered and matched is appended to the string of tokens currently pointed to by the current stack pointer.
- If a token doesn’t match anything in any of the current paths being followed, then of course report that the input is invalid/doesn’t adhere to the specified grammar.
So, in the grammar specification given in the PDF linked to, which was…
S = A '#'
A = 'a' A 'b'
| 'c'
it would…
- Poll ‘c’ and ‘A’ from the grammar in concurrent lineages, where ‘A’ expands to ‘a A b’, so your current symbols you’re polling to compare the next token to are ‘c’ and ‘a’.
- If the next token is ‘c’, collapse the ‘A’ rule and poll what follows it in the graph, the ‘#’
- If the next token after ‘c’ is ‘#’, then end, because ‘#’ is an end-state in the graph. (If there are further tokens to be read, report invalid syntax.)
- If the next token was ‘a’ instead of ‘c’, add one to the stack pointer, drop the ‘c’ lineage, and keep following the ‘a A b’ lineage, which would mean your next polled values are ‘a’ and ‘c’
- If the next token is ‘c’, go to 2.
- If the next token is ‘a’, go to 4.
So, if, for example, you had two nested |s in the grammar, that would generate a parsing graph that would make the parser compare each input token to four different graph symbols until one matches, for as long as the input text falls under both nested |s.
In the bar = 'a'* rule above, the a node would point to itself first, then to the y node. But, if the rule were bar = 'a'*?, the a node would point to the y node first, then to itself.
Similarly to how 'a'* would result in a node pointing to a first and then to the next symbol, while the pointers in 'a'*? would be in the reverse order, 'a'+?, 'a'??, and 'a'{2,3}? vs. 'a'+, 'a'? and 'a'{2,3} would also work by reversing the orders of the pointers.
'a'+ would be implemented the same way as a followed by 'a'*.
'a'{2,3} would be implemented in the same way as (aa)|(aaa) would.
'a'? 'b' would be implemented as a node containing ‘a’ that points to both the ‘b’ node and the node after it.
!('a' foo) would be implemented with a special node type starting the pattern that tells the parser it’s a ‘not’, so that if the entire pattern is exhausted while matching the input data, the parsing fails with a syntax error. The first node of the pattern would also point to what comes after the end of the pattern, so that once the matching fails, it can follow that pointer to continue.
&('a' foo) would be implemented with a special node type starting the pattern that tells the parser it’s an ‘and’, so that if the entire pattern is not exhausted while matching the input data, the parsing fails with a syntax error. The first node of the pattern would also point to what comes after the end of the pattern, so that once the matching succeeds, it can follow the pointer to continue.
I tried implementing this parsing algorithm in Python, but my brain kept melting while trying to pre-code the parsed tree for the specification of EPEG so that I could convert it to a graph so that I could parse EPEG grammars using the graph so that I could convert them to graphs so that I could parse code input based on specified EPEG grammars. So, somebody else will have to do it…
